Lab 3: Modeling correlation and regression

Practice session covering topics discussed in Lecture 3

M. Chiara Mimmi, Ph.D. | Università degli Studi di Pavia

July 27, 2024

GOAL OF TODAY’S PRACTICE SESSION

Understand …..

The examples and code from this lab session follow very closely the open book:
+ Vu, J., & Harrington, D. (2021). Introductory Statistics for the Life and Biomedical Sciences. https://www.openintro.org/book/biostat/

Topics discussed in Lecture # 3

Lecture 3: topics

  • Testing for a correlation hypothesis (relationship of variables)
    • Pearson rho analysis (param)
    • Spearman test (no param)
  • Measures of association
    • Fisher’s Exact Test
    • Chi-Square Test of Independence
  • From correlation/association to causation
    • introduction to experiments
      • Example: Linear regression models
      • Example: Multiple Linear Regression
  • From causation to prediction
    • introduction to Machine Learning
      • Supervised algorithms
      • Unsupervised algorithms

R ENVIRONMENT SET UP & DATA

Needed R Packages

  • We will use functions from packages base, utils, and stats (pre-installed and pre-loaded)
  • We will also use the packages below (specifying package::function for clarity).
# Load them for this R session

# General 
library(fs)      # file/directory interactions
library(here)    # tools find your project's files, based on working directory
library(janitor) # tools for examining and cleaning data
library(dplyr)   # {tidyverse} tools for manipulating and summarizing tidy data 
library(forcats) # {tidyverse} tool for handling factors
library(openxlsx) # Read, Write and Edit xlsx Files
library(flextable) # Functions for Tabular Reporting

# Statistics
library(rstatix) # Pipe-Friendly Framework for Basic Statistical Tests
library(lmtest) # Testing Linear Regression Models # Testing Linear Regression Models
library(broom) # Convert Statistical Objects into Tidy Tibbles
library(tidymodels) # not installed on this machine
# Plotting
library(ggplot2) # Create Elegant Data Visualisations Using the Grammar of Graphics
library(ggpubr) # 'ggplot2' Based Publication Ready Plots
# load some colors
colors <- readRDS(here::here("practice", "data_input", "03_datasets","colors.rds"))

DATASETS for today

Importing Dataset 1 (NHANES)

  • Adapting the function here to match your own folder structure

Name: NHANES (National Health and Nutrition Examination Survey) combines interviews and physical examinations to assess the health and nutritional status of adults and children in the United States. Sterted in the 1960s, it became a continuous program in 1999. Documentation: dataset1
Sampling details: Here we use a sample of 500 adults from NHANES 2009-2010 & 2011-2012 (nhanes.samp.adult.500 in the R oibiostat package, which has been adjusted so that it can be viewed as a random sample of the US population)

# Check my working directory location
# here::here()

# Use `here` in specifying all the subfolders AFTER the working directory 
nhanes_samp <- read.csv(file = here::here("practice", "data_input", "03_datasets",
                                      "nhanes_samp.csv"), 
                          header = TRUE, # 1st line is the name of the variables
                          sep = ",", # which is the field separator character.
                          na.strings = c("?","NA" ), # specific MISSING values  
                          row.names = NULL) 

NHANES Variables and their description

Variable Type Description
X int xxxx
ID int xxxxx
SurveyYr chr yyyy_mm. Ex. 2011_12
Gender chr Gender (sex) of study participant coded as male or female
Age int ##
AgeDecade chr yy-yy es 20-29
Race1 chr Reported race of study participant: Mexican, Hispanic, White, Black, or Other
Race3 chr Reported race of study participant... Not availale for 2009-10
Education chr [>= 20 yro]. Ex. 8thGrade, 9-11thGrade, HighSchool, SomeCollege, or CollegeGrad.
MaritalStatus chr [>= 20 yro]. Ex. Married, Widowed, Divorced, Separated, NeverMarried, or LivePartner
HHIncome chr Total annual gross income for the household in US dollars
HHIncomeMid int Numerical version of HHIncome derived from the middle income in each category. Ex. 12500 40000
Poverty dbl A ratio of family income to poverty guidelines. Smaller numbers indicate more poverty Ex.. 0.95 1.74 4.99
HomeRooms int How many rooms are in home of study participant (counting kitchen but not bath room).
HomeOwn chr One of Home, Rent, or Other
Work chr NotWorking Working
Weight dbl Weight in kg
Height dbl Standing height in cm. Reported for participants aged 2 years or older.
BMI dbl Body mass index (weight/height2 in kg/m2). Reported for participants aged 2 years or older
Pulse int 60 second pulse rate
BPSysAve int Combined systolic blood pressure reading, following the procedure outlined for BPXSAR
BPDiaAve int Combined diastolic blood pressure reading, following the procedure outlined for BPXDAR
BPSys1 int Systolic blood pressure in mm Hg – first reading
BPDia1 int Diastolic blood pressure in mm Hg – second reading (consecutive readings)
BPSys2 int Systolic blood pressure in mm Hg – second reading (consecutive readings)
BPDia2 int Diastolic blood pressure in mm Hg – second reading
BPSys3 int Systolic blood pressure in mm Hg third reading (consecutive readings)
BPDia3 int Diastolic blood pressure in mm Hg – third reading (consecutive readings)
Testosterone dbl Testerone total (ng/dL). Reported for participants aged 6 years or older. Not available for 2009-2010
DirectChol dbl Direct HDL cholesterol in mmol/L. Reported for participants aged 6 years or older
TotChol dbl Total HDL cholesterol in mmol/L. Reported for participants aged 6 years or older
UrineVol1 int Urine volume in mL – first test. Reported for participants aged 6 years or older
UrineFlow1 dbl Urine flow rate in mL/min – first test. Reported for participants aged 6 years or older
UrineVol2 int Urine volume in mL – second test
UrineFlow2 dbl Urine flow rate (urine volume/time since last urination) in mL/min – second test
Diabetes chr Study participant told by a doctor or health professional that they have diabetes
DiabetesAge int Age of study participant when first told they had diabetes
HealthGen chr Self-reported rating of health: Excellent, Vgood, Good, Fair, or Poor Fair
DaysPhysHlthBad int Self-reported # of days participant’s physical health was not good out of the past 30 days
DaysMentHlthBad int Self-reported # of days participant’s mental health was not good out of the past 30 days
LittleInterest chr Self-reported # of days where participant had little interest in doing things. Among: None, Several, Majority, or AlmostAll
Depressed chr Self-reported # of days where participant felt down, depressed or hopeless. Among: None, Several, Majority, or AlmostAll
nPregnancies int # times participant has been pregnant
nBabies int # deliveries resulted in live births
PregnantNow chr Pregnancy status ascertained for females 8-59 years of age
Age1stBaby int Age of participant at time of first live birth
SleepHrsNight int Self-reported # of hours study participant gets at night on weekdays or workdays. For participants aged 16 years and older
SleepTrouble chr Participant [16 years and older] has had trouble sleeping. Coded as Yes or No.
PhysActive chr Participant does moderate or vigorous-intensity sports, fitness or recreational activities (Yes or No).
PhysActiveDays int Number of days in a typical week that participant does moderate or vigorous intensity activity.
TVHrsDay chr Number of hours per day on average participant watched TV over the past 30 days.
CompHrsDay chr Number of hours per day on average participant used a computer or gaming device over the past 30 day
TVHrsDayChild lgl [2-11 yro] Number of hours per day on average participant watched TV over the past 30 days.
CompHrsDayChild lgl [2-11 yro] Number of hours per day on average participant used a computer or gaming device over the past 30 day
Alcohol12PlusYr chr Participant has consumed at least 12 drinks of any type of alcoholic beverage in any one year
AlcoholDay int Average number of drinks consumed on days that participant drank alcoholic beverages
AlcoholYear int [>+ 18yro] Estimated number of days over the past year that participant drank alcoholic beverages
SmokeNow chr Study participant currently smokes cigarettes regularly. (Yes or No)
Smoke100 chr Study participant has smoked at least 100 cigarettes in their entire life. (Yes pr No)
Smoke100n chr Smoker Non-Smoker
SmokeAge int Age study participant first started to smoke cigarettes fairly regularly
Marijuana chr Participant has tried marijuana
AgeFirstMarij int Age Participant has tried marijuana first
RegularMarij chr Participant has been/is a regular marijuana user (used at least once a month for a year) (Yes or No)
AgeRegMarij int Age of participant when first started regularly using marijuana
HardDrugs chr Participant has tried cocaine, crack cocaine, heroin or methamphetamine (Yes or No)
SexEver chr Participant had had sex (Yes or No)
SexAge int Age Participant had had sex first time
SexNumPartnLife int Number of opposite sex partners participant has had
SexNumPartYear int Number of opposite sex partners over the past 12 months
SameSex chr Participant has had any kind of sex with a same sex partne(Yes or No)
SexOrientation chr Participant’s sexual orientation One of Heterosexual, Homosexual, Bisexual

Importing Dataset 2 (PREVEND)

Data from the Prevention of Renal and Vascular End-stage Disease (PREVEND) study, which took place in the Netherlands. The study collected various demographic and cardiovascular risk factors. This dataset is from the third survey, which participants completed in 2003-2006; data is provided for 4,095 individuals who completed cognitive testing.

Name: PREVEND (Prevention of REnal and Vascular END-stage Disease) is a study which took place in the Netherlands on 8,592 participants aged 28-75, with subsequent follow-ups in 1997-1998. A second survey was conducted in 2001-2003 on 6,894 participants, 5,862 completed the third survey in 2003-2006 (here measurement of cognitive function was added to the study protocol).
Documentation: dataset2
Sampling details: Data from 4,095 individuals who completed cognitive testing are in the prevend dataset, available in the R package oibiostat.

# Check my working directory location
# here::here()

# Use `here` in specifying all the subfolders AFTER the working directory 
prevend <- read.csv(file = here::here("practice", "data_input", "03_datasets",
                                      "prevend.csv"), 
                          header = TRUE, # 1st line is the name of the variables
                          sep = ",", # which is the field separator character.
                          na.strings = c("?","NA" ), # specific MISSING values  
                          row.names = NULL) 

PREVEND Variables and their description

Variable Description
... ...

Importing Dataset 3 (FAMuSS)

Name: FAMuSS (Functional SNPs Associated with Muscle Size and Strength) examine the association of demographic, physiological and genetic characteristics with muscle strength – including data on race and genotype at a specific locus on the ACTN3 gene (the “sports gene”).
Documentation: dataset3
Sampling details: the DATASET includes 595 observations on 9 variables

# Check my working directory location
# here::here()

# Use `here` in specifying all the subfolders AFTER the working directory 
famuss <- read.csv(file = here::here("practice", "data_input", "03_datasets",
                                      "famuss.csv"), 
                          header = TRUE, # 1st line is the name of the variables
                          sep = ",", # which is the field separator character.
                          na.strings = c("?","NA" ), # specific MISSING values  
                          row.names = NULL) 

FAMuSS Variables and their description

Variable Description
X id
ndrm.ch Percent change in strength in the non-dominant arm
drm.ch Percent change in strength in the dominant arm
sex Sex of the participant
age Age in years
race Recorded as African Am (African American), Caucasian, Asian, Hispanic, Other
height Height in inches
weight Weight in pounds
actn3.r577x Genotype at the location r577x in the ACTN3 gene.
bmi Body Mass Index

CORRELATION

Explore relationships between two variables

Approaches for summarizing relationships between two variables vary depending on variable types…

  • Two numerical variables
  • Two categorical variables
  • One numerical variable and one categorical variable

Two variables \(x\) and \(y\) are

  • positively associated if \(y\) increases as \(x\) increases.
  • negatively associated if \(y\) decreases as \(x\) increases.

TWO NUMERICAL VARIABLES (NHANES)

Two numerical variables (plot)

Height and weight (taken from the nhanes_samp dataset) are positively associated.

  • notice we can also use the generic function base::plot for a simple scatter plot
# rename for convenience
nhanes <- nhanes_samp %>% 
  janitor::clean_names()

# basis plot 
plot(nhanes$height, nhanes$weight,
     xlab = "Height (cm)", ylab = "Weight (kg)", cex = 0.8)  

Two numerical variables (plot)

Two numerical variables: correlation (with stats::cor)

Correlation is a numerical summary that measures the strength of a linear relationship between two variables.

  • The correlation coefficient \(r\) takes on values between \(-1\) and \(1\).

  • The closer \(r\) is to \(\pm 1\), the stronger the linear association.

  • Here we compute the Pearson rho (parametric), with the function cor

    • notice the use argument to choose how to deal with missing values (in this case only using all complete pairs)
is.numeric(nhanes$height) 
[1] TRUE
is.numeric(nhanes$weight)
[1] TRUE
# using `stats` package
cor(x = nhanes$height, y =  nhanes$weight, 
    # argument for dealing with missing values
    use = "pairwise.complete.obs",
    method = "pearson")
[1] 0.4102269

Two numerical variables: correlation (with stats::cor.test)

  • Here we compute the Pearson rho (parametric), with the function cor.test (the same we used for testing paired samples)
    • implicitely takes care on NAs
# using `stats` package 
cor_test_result <- cor.test(x = nhanes$height, y =  nhanes$weight, 
                            method = "pearson")

# looking at the cor estimate
cor_test_result[["estimate"]][["cor"]]
[1] 0.4102269
  • The function ggpubr::ggscatter gives us all in one (scatter plot + \(r\) (“R”))
library("ggpubr") # 'ggplot2' Based Publication Ready Plots
ggpubr::ggscatter(nhanes, x = "height", y = "weight", 
                  cor.coef = TRUE, cor.method = "pearson", #cor.coef.coord = 2,
                  xlab = "Height (in)", ylab = "Weight (lb)")

Two numerical variables: correlation (with stats::cor.test)

Spearman rank-order correlation

The Spearman’s rank-order correlation is the nonparametric version of the Pearson correlation.

Spearman’s correlation coefficient, (\(ρ\), also signified by \(rs\)) measures the strength and direction of association between two ranked variables.

  • used when 2 variables have a non-linear relationship
  • excellent for ordinal data (when Pearson’s is not appropriate), i.e. Likert scale items

To compute it, we simply calculate Pearson’s correlation of the rankings of the raw data (instead of the data).

Spearman rank-order correlation (example)

Let’s say we want to get Spearman’s correlation with ordinal factors Education and HealthGen in the NHANES sample. I have to convert them to their underlying numeric code, to compare rankings.

tabyl(nhanes$education)
 nhanes$education   n percent valid_percent
        8th Grade  32   0.064    0.06412826
   9 - 11th Grade  68   0.136    0.13627255
     College Grad 157   0.314    0.31462926
      High School  94   0.188    0.18837675
     Some College 148   0.296    0.29659319
             <NA>   1   0.002            NA
tabyl(nhanes$health_gen)
 nhanes$health_gen   n percent valid_percent
         Excellent  47   0.094    0.10444444
              Fair  53   0.106    0.11777778
              Good 177   0.354    0.39333333
              Poor  11   0.022    0.02444444
             Vgood 162   0.324    0.36000000
              <NA>  50   0.100            NA
nhanes <- nhanes %>% 
  # reorder education
  mutate (edu_ord = factor (education, 
                            levels = c("8th Grade", "9 - 11th Grade",
                                       "High School", "Some College",
                                       "College Grad" , NA))) %>%  
  # create edu_rank 
  mutate (edu_rank = as.numeric(edu_ord)) %>% 
  # reorder health education
  mutate (health_ord = factor (health_gen, 
                            levels = c( NA, "Poor", "Fair",
                                       "Good", "Vgood",
                                       "Excellent"))) %>%
  # create health_rank 
  mutate (health_rank = as.numeric(health_ord)) 

table(nhanes$edu_ord, useNA = "ifany" )

     8th Grade 9 - 11th Grade    High School   Some College   College Grad 
            32             68             94            148            157 
          <NA> 
             1 
table(nhanes$edu_rank, useNA = "ifany" )

   1    2    3    4    5 <NA> 
  32   68   94  148  157    1 
table(nhanes$health_ord, useNA = "ifany" )

     Poor      Fair      Good     Vgood Excellent      <NA> 
       11        53       177       162        47        50 
table(nhanes$health_rank,  useNA = "ifany" )

   1    2    3    4    5 <NA> 
  11   53  177  162   47   50 

Spearman rank-order correlation (example cont.)

Now we can actually compute it

  • After setting up the variables correctly (as.numeric), just specify method = "spearman"
# using `stats` package 
cor_test_result_sp <- cor.test(x = nhanes$edu_rank,
                               y = nhanes$health_rank, 
                               method = "spearman", 
                               exact = FALSE) # removes the Ties message warning 
# looking at the cor estimate
cor_test_result_sp

    Spearman's rank correlation rho

data:  nhanes$edu_rank and nhanes$health_rank
S = 10641203, p-value = 1.915e-10
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.2946493 
#cor_test_result_sp[["estimate"]][["rho"]]

TWO CATEGORICAL VARIABLES (FAMuSS)

Two categorical variables (plot)

In the famuss dataset, the variables race, and actn3.r577x are categorical variables.

## genotypes as columns
genotype.race = matrix(table(famuss$actn3.r577x, famuss$race), ncol=3, byrow=T)
colnames(genotype.race) = c("CC", "CT", "TT")
rownames(genotype.race) = c("African Am", "Asian", "Caucasian", "Hispanic", "Other")

barplot(genotype.race, col = colors[c(7, 4, 1, 2, 3)], ylim=c(0,300), width=2)
legend("topright", inset=c(.05, 0), fill=colors[c(7, 4, 1, 2, 3)], 
       legend=rownames(genotype.race))

Two categorical variables (contingency table)

Specifically, the variable actn3.r577x takes on three possible levels (CC, CT, or TT) which indicate the distribution of genotype at location r577x on the ACTN3 gene for the FAMuSS study participants.

A contingency table summarizes data for two categorical variables.

  • the function stats::addmargins puts arbitrary Margins on Multidimensional Tables
    • The extra column & row "Sum" provide the marginal totals across each row and each column, respectively
# levels of actn3.r577x
table(famuss$actn3.r577x)

 CC  CT  TT 
173 261 161 
# contingency table to summarize race and actn3.r577x
addmargins(table(famuss$race, famuss$actn3.r577x))
            
              CC  CT  TT Sum
  African Am  16   6   5  27
  Asian       21  18  16  55
  Caucasian  125 216 126 467
  Hispanic     4  10   9  23
  Other        7  11   5  23
  Sum        173 261 161 595

Two categorical variables (contingency table prop)

Contingency tables can also be converted to show proportions. Since there are 2 variables, it is necessary to specify whether the proportions are calculated according to the row variable or the column variable. + using the margin = argument in the base::prop.table function (1 indicates rows, 2 indicates columns)

# adding row proportions
addmargins(prop.table(table(famuss$race, famuss$actn3.r577x), margin =  1))
            
                    CC        CT        TT       Sum
  African Am 0.5925926 0.2222222 0.1851852 1.0000000
  Asian      0.3818182 0.3272727 0.2909091 1.0000000
  Caucasian  0.2676660 0.4625268 0.2698073 1.0000000
  Hispanic   0.1739130 0.4347826 0.3913043 1.0000000
  Other      0.3043478 0.4782609 0.2173913 1.0000000
  Sum        1.7203376 1.9250652 1.3545972 5.0000000
# adding column proportions
addmargins(prop.table(table(famuss$race, famuss$actn3.r577x),margin =  2))
            
                     CC         CT         TT        Sum
  African Am 0.09248555 0.02298851 0.03105590 0.14652996
  Asian      0.12138728 0.06896552 0.09937888 0.28973168
  Caucasian  0.72254335 0.82758621 0.78260870 2.33273826
  Hispanic   0.02312139 0.03831418 0.05590062 0.11733618
  Other      0.04046243 0.04214559 0.03105590 0.11366392
  Sum        1.00000000 1.00000000 1.00000000 3.00000000

Chi Squared test of independence

The Chi-squared test is a hypothesis test used to determine whether there is a relationship between two categorical variables.

  • categorical vars. can have nominal or ordinal measurement scale
  • the observed frequencies are compared with the expected frequencies and their deviations are examined.
# Chi-squared test
# (Test of association to see if 
# H0: the 2 cat var (race  & actn3.r577x ) are independent
# H1: the 2 cat var are correlated in __some way__

tab <- table(famuss$race, famuss$actn3.r577x)
test_chi <- chisq.test(tab)

the obtained result (test_chi) is a list of objects…

You try…

…run View(test_chi) to check

Chi Squared test of independence (cont)

Within test_chi results there are:

  • Observed frequencies = how often a combination occurs in our sample
# Observed frequencies
test_chi$observed
            
              CC  CT  TT
  African Am  16   6   5
  Asian       21  18  16
  Caucasian  125 216 126
  Hispanic     4  10   9
  Other        7  11   5
  • Expected frequencies = what would it be if the 2 vars were PERFECTLY INDEPENDENT
# Expected frequencies
round(test_chi$expected  , digits = 1 )
            
                CC    CT    TT
  African Am   7.9  11.8   7.3
  Asian       16.0  24.1  14.9
  Caucasian  135.8 204.9 126.4
  Hispanic     6.7  10.1   6.2
  Other        6.7  10.1   6.2

Chi Squared test of independence (results)

  • Recall that
    • \(H_{0}\): the 2 cat. var. are independent
    • \(H_{1}\): the 2 cat. var. are correlated in some way
  • The result of Chi-Square test represents a comparison of the above two tables (observed v. expected):
    • p-value = 0.01286 smaller than α = 0.05 so we REJECT the null hypothesis
test_chi

    Pearson's Chi-squared test

data:  tab
X-squared = 19.4, df = 8, p-value = 0.01286

Computing Cramer’s V after test of independence

Recall that Crammer’s V is a way in which we can measure effect size of the test of independence (i.e. a measure of the strength of association between two nominal variables)

  • V ranges from [0 1] (the smaller V, the lower the correlation)

\[V=\sqrt{\frac{\chi^2}{n(k-1)}} \]

where:

  • \(V\) denotes Cramér’s V
  • \(\chi^2\) is the Pearson chi-square statistic from the aforementioned test;
  • \(n\) is the sample size involved in the test and
  • \(k\) is the lesser number of categories of either variable

Computing Cramer’s V after test of independence (2 ways)

  • ✍🏻 By hand” first to see the steps
# Compute Creamer's V
 
# inputs 
chi_calc <- test_chi$statistic
n <- nrow(famuss) # N of obd 
n_r <- nrow(test_chi$observed) # number of rows in the contingency table
n_c <- ncol(test_chi$observed) # number of columns in the contingency table

# Cramer’s V
sqrt(chi_calc / (n*min(n_r -1, n_c -1)) )
X-squared 
0.1276816 
# Cramer’s V 
rstatix::cramer_v(test_chi$observed)
[1] 0.1276816

A Cramer’s V of 0.12 indicates a relatively weak association between the two categorical variables. It suggests that while there may be some relationship between the variables, it is not particularly strong

Chi Squared test of goodness of fit

In this case, we are conducting a type of Pearson’s Chi-square test

  • used to test whether the observed distribution of a categorical variable differs from your expectations
  • the statistic is based on the discrepancies between observed and expected counts

Chi Squared test of goodness of fit (example)

Since the participants of the FAMuSS study where volunteers at a university, they did not come from a “representative” sample of the US population

  • The \(\chi^{2}\) test can be used to test the \(H_{0}\) that the participants are racially representative of the general population

Race

African.American

Asian

Caucasian

Other

Total

FAMuSS (Observed)

27

55

467

46

595

US Census (Expected)

76.16

5.95

478.38

34.51

595

We use the formula \(\chi^{2} = \sum_{k}\frac{(Observed - Expected)^{2}}{Expected}\),
under \(H_{0}\) = the sample proportions should equal the population proportions.

Chi Squared test of goodness of fit (example)

# Subset the vectors of frequencies from the 2 rows  
observed <- c(27,  55,  467, 46)
expected <- c(76.2,  5.95, 478.38,  34.51)

# Calculate Chi-Square statistic manually 
chi_sq_statistic <- sum((observed - expected)^2 / expected) 
df <- length(observed) - 1 
p_value <- 1 - pchisq(chi_sq_statistic, df) 

# Print results 
chi_sq_statistic
[1] 440.2166
df
[1] 3
p_value 
[1] 0

The calculated \(\chi^{2}\) statistic is very large, and the p_value is close to 0. Hence, There is more than sufficient evidence to reject the null hypothesis that the sample is representative of the general population.
A comparison of the observed and expected values (or the residuals) indicates that the largest discrepancy is with the over-representation of Asian participants.

FROM CORRELATION/ ASSOCIATION TO CAUSATION

Visualize the data

We are mainly looking for a “vaguely” linear shape here

  • ggplot2 gives us a visual confirmation via linear best fit (the least squares regression line), using method = lm, & its 95% CI
ggplot(nhanes, aes (x = age, 
                          y = bmi)) + 
  geom_point() + 
  geom_smooth(method = lm,  
              #se = FALSE
              )

Visualize the data

Linear regression models

The lm() function is used to fit linear models has the following generic structure:

lm(y ~ x, data)

where:

  • the 1st argument y ~ x specifies the variables used in the model (here the model regresses a response variable \(y\) against an explanatory variable \(x\).
  • The 2nd argument data is used only when the dataframe name is not already specified in the first argument.

Linear regression models syntax

The following example shows fitting a linear model that predicts BMI from age (in years) using data from nhanes adult sample (individuals 21 years of age or older from the NHANES data).

#fitting linear model
lm(nhanes$bmi ~ nhanes$age)
#equivalently...
lm(bmi ~ age, data = nhanes)

Call:
lm(formula = bmi ~ age, data = nhanes)

Coefficients:
(Intercept)          age  
   28.40113      0.01982  
  • Running the function creates an object (of class lm) that contains several components (model coefficients, etc), either directly displayed or accessible with summary() notation or specific functions.

Linear regression models syntax

We can save the model and then extract individual output elements from it using the $ syntax

# name the model object
lr_model <- lm(bmi ~ age, data = nhanes)

# extract model output elements
lr_model$coefficients
lr_model$residuals
lr_model$fitted.values

The command summary returns these elements

  • Call: reminds the equation used for this regression model
  • Residuals: a 5 number summary of the distribution of residuals from the regression model
  • Coefficients:displays the estimated coefficients of the regression model and relative hypothesis testing, given for:
    • intercept
    • explanatory variable(s) slope

Linear regression models interpretation: coefficients

  • The model tests the null hypothesis that a coefficient is 0
  • Coefficients results display: estimate, std. error, t-statistic, p-value that corresponds to the t-statistic for:
    • intercept
    • explanatory variable(s) slope
  • In regression, the population parameter of interest is typically the slope parameter
    • here age doesn’t appear significantly ≠ 0
summary(lr_model)$coefficients 
               Estimate Std. Error   t value      Pr(>|t|)
(Intercept) 28.40112932 0.96172389 29.531480 2.851707e-111
age          0.01981675 0.01824641  1.086063  2.779797e-01

Linear regression models interpretation: Coefficients 2

For the the estimated coefficients of the regression model, we get:

  • Estimate = the average increase in the response variable associated with a one unit increase in the predictor variable, (assuming all other predictor variables are held constant).
  • Std. Error = a measure of the uncertainty in our estimate of the coefficient.
  • t value = the t-statistic for the predictor variable, calculated as (Estimate) / (Std. Error).
  • Pr(>|t|) = the p-value that corresponds to the t-statistic. If less than some alpha level (e.g. 0.05). the predictor variable is said to be statistically significant.

Linear regression models outputs: fitted values

Here we see \(\hat{y}_i\), i.e. the fitted \(y\) value for the \(i\)-th individual

fit_val <- lr_model$fitted.values
# print the first 6 elements
head(fit_val)
       1        2        3        4        5        6 
29.39197 29.33252 29.31270 28.95600 29.39197 29.17398 

Linear regression models outputs: residuals

Here we see \(e_i = y_i - \hat{y}_i\), i.e. the residual value for the \(i\)-th individual

resid_val <- lr_model$residuals 
# print the first 6 elements
head(resid_val)
          1           2           3           4           5           6 
-1.49196704  0.06748322 -3.96270002 -3.15599844 -2.49196704  3.75601726 

Linear regression model’s fit: Residual standard error

  • The Residual standard error (an estimate of the parameter \(\sigma\)) tells the average distance that the observed values fall from the regression line (we are assuming constant variance).
    • The smaller it is, the better the model fits the dataset!

We can compute it manually as:

\({\rm SE}_{resid}=\ \sqrt{\frac{\sum_{i=1}^{n}{(y_i-{\hat{y}}_i)}^2}{{\rm df}_{resid}}}\)

# Residual Standard error (Like Standard Deviation)

# ---  inputs 
# sample size
n =length(lr_model$residuals)
# n of parameters in the model
k = length(lr_model$coefficients)-1 #Subtract one to ignore intercept
# degrees of freedom of the the residuals 
df_resid = n-k-1
# Squared Sum of Errors
SSE =sum(lr_model$residuals^2) # 22991.19

# --- Residual Standard Error
ResStdErr <- sqrt(SSE/df_resid)  # 6.815192
ResStdErr
[1] 6.815192

Linear regression model’s fit: : \(R^2\) and \(Adj. R^2\)

The \(R^2\) tells us the proportion of the variance in the response variable that can be explained by the predictor variable(s).

  • if \(R^2\) close to 0 -> data more spread
  • if \(R^2\) close to 1 -> data more tight around the regression line
# --- R^2
summary(lr_model)$r.squared
[1] 0.00237723

The \(Adj. R^2\) is a modified version of R-squared that has been adjusted for the number of predictors in the model. It is always lower than the R-squared.

  • it can be useful for comparing the fit of different regression models that use different numbers of predictor variables.
# --- Adj. R^2
summary(lr_model)$adj.r.squared
[1] 0.0003618303

Linear regression model’s fit: : F statistic

The F-statistic indicates whether the regression model provides a better fit to the data than a model that contains no independent variables. In essence, it tests if the regression model as a whole is useful.

# extract only F statistic 
summary(lr_model)$fstatistic 
     value      numdf      dendf 
  1.179533   1.000000 495.000000 
# define function to extract overall p-value of model
overall_p <- function(my_model) {
    f <- summary(my_model)$fstatistic
    p <- pf(f[1],f[2],f[3],lower.tail=F)
    attributes(p) <- NULL
    return(p)
}

# extract overall p-value of model
overall_p(lr_model)
[1] 0.2779797

Given the p-value is > 0.05, this indicate that the predictor variable is not useful for predicting the value of the response variable

DIAGNOSTIC PLOTS

The following plots help us assessing that (some of) the assumptions of linear regression are met!

(the independence assumption is more linked to the study design than to the data used in modeling)

Linear regression diagnostic plots: residuals 1/4

ASSUMPTION 1: there exists a linear relationship between the independent variable, x, and the dependent variable, y

For an observation \((x_i, y_i)\), where \(\hat{y}_i\) is the predicted value according to the line \(\hat{y} = b_0 + b_1x\), the residual is the value \(e_i = y_i - \hat{y}_i\)

  • A linear lr_model is a particularly good fit for the data when the residual plot shows random scatter above and below the horizontal line.
    • (In this R plot, we look for a red line that is fairly straight)
# residual plot
plot(lr_model, which = 1 )

Linear regression diagnostic plots: residuals 1/4

Linear regression diagnostic plots: normality of residuals 2/4

ASSUMPTION 2: The residuals of the model are normally distributed

With the quantile-quantile plot (Q_Q) we can checking normality of the residuals.

# quantile-quantile plot
plot(lr_model, which = 2 )

Linear regression diagnostic plots: normality of residuals 2/4

The data appear roughly normal, but there are deviations from normality in the tails, particularly the upper tail.

Linear regression diagnostic plots: Homoscedasticity 3/4

ASSUMPTION 3: The residuals have constant variance at every level of x (“homoscedasticity”)

This one is called a Spread-location plot: shows if residuals are spread equally along the ranges of predictors

# Spread-location plot
plot(lr_model, which = 3 )

Linear regression diagnostic plots: Homoscedasticity 3/4

Test for Homoscedasticity

Besides visual check, we can perform the Breusch-Pagan test to verify the assumption of homoscedasticity. In this case:

  • \(H_{0}\): residuals are distributed with equal variance

  • \(H_{1}\): residuals are distributed with UNequal variance

  • we use bptest function from the lmtest package

# Breusch-Pagan test against heteroskedasticity 
lmtest::bptest(lr_model)

    studentized Breusch-Pagan test

data:  lr_model
BP = 2.7548, df = 1, p-value = 0.09696
# BP = 1.5798, df = 1, p-value = 0.2088

# Because the test statistic (BP) is small and the p-value is not significant  (p-value > 0.05)  WE DO NOT REJECT THE NULL HYP

Because the test statistic (BP) is small and the p-value is not significant (p-value > 0.05): WE DO NOT REJECT THE NULL HYP

Linear regression diagnostic plots: leverage 4/4

This last diagnostic plot is actually not referred to any assumptions but has to do with outliers:

  • a residuals vs. leverage plot allows us to identify influential observations in a regression model
    • The x-axis shows the “leverage” of each point and the y-axis shows the “standardized residual of each point”, i.e. “How much would the coefficients in the regression model would change if a particular observation was removed from the dataset?”
    • Cook’s distance lines (red dashed lines) – not visible here – appear on the corners of the plot when there are influential cases
plot(lr_model, which = 5 )

Linear regression diagnostic plots: leverage 4/4

In this particular case, there is no influential case, or cases

(Small digression on the broom package)

  • The broom package introduces the tidy approach to regression modeling code and outputs, allowing to convert/save them in the form of tibbles
  • the function augment will show a lot of results for the model attached to each observation
    • this is very useful for further use of such objects, like ggplot2 etc.
# render model as a dataframe 
broom::tidy(lr_model)
# A tibble: 2 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)  28.4       0.962      29.5  2.85e-111
2 age           0.0198    0.0182      1.09 2.78e-  1
# see overal performance 
broom::glance(lr_model)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
1   0.00238      0.000362  6.82      1.18   0.278     1 -1658. 3322. 3335.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# save an object with all the model output elements 
model_aug <- broom::augment(lr_model)

You try…

…run View(model_aug) to check

_______ >>>>> SALTO

https://www.statology.org/interpret-regression-output-in-r/ https://www.youtube.com/watch?v=ebHLMyqC2UY https://www.learnbymarketing.com/tutorials/explaining-the-lm-summary-in-r/

Splitting the sample

If we seek any causal relationship between an explanatory and outcome variable we should split our sample to have:

  1. one sample to "train" a model on
  2. one sample to "test" the model on
  • Otherwise out model will seem better than it is (since it will be specifically built to “fit” our data)

  • The function rsample::initial_split will assist in that

nhanes_split <- rsample::initial_split(nhanes)
# default = 75%  of observations 
nhanes_train <- rsample::training(nhanes_split)
# default = 25%  of observations 
nhanes_test <- rsample::testing(nhanes_split)

Linear regression performance

We can start looking at how the model performs by applying it to our nhanes_test sub-sample, utilizing the function predict

# recall the dataset size 
nrow(nhanes_test) # 125 
[1] 125
# obtain 125 predicted observations 
predict(lr_model, nhanes_test)
       1        2        3        4        5        6        7        8 
29.17398 28.81728 29.07490 29.66940 29.51087 29.33252 29.53068 29.47123 
       9       10       11       12       13       14       15       16 
29.03527 29.60995 29.78830 29.31270 29.70904 29.41178 29.53068 29.05508 
      17       18       19       20       21       22       23       24 
29.41178 29.31270 28.81728 29.33252 29.84775 29.94684 29.11453 29.27307 
      25       26       27       28       29       30       31       32 
29.27307 29.51087 29.25325 29.68922 29.19380 29.39197 29.72885 29.80812 
      33       34       35       36       37       38       39       40 
28.99563 29.92702 29.17398 29.90720 29.27307 29.82794 29.82794 29.49105 
      41       42       43       44       45       46       47       48 
29.57032 29.29288 29.88739 29.35233 29.49105 29.01545 28.85691 29.27307 
      49       50       51       52       53       54       55       56 
28.97582 28.83710 29.55050 29.19380 29.86757 29.49105 29.59013 29.62977 
      57       58       59       60       61       62       63       64 
28.89655 29.25325 29.37215 29.33252 28.91636 29.62977 29.37215 28.83710 
      65       66       67       68       69       70       71       72 
29.33252 28.99563 28.97582 29.43160 28.81728 29.45142 29.98647 29.53068 
      73       74       75       76       77       78       79       80 
29.17398 29.62977 29.80812 29.27307 29.98647 28.81728 29.68922 29.05508 
      81       82       83       84       85       86       87       88 
29.43160 29.19380 29.35233 29.25325 29.41178 29.29288 29.43160 29.19380 
      89       90       91       92       93       94       95       96 
29.70904 29.43160 29.05508 29.74867 29.53068 29.23343 28.93618 29.98647 
      97       98       99      100      101      102      103      104 
28.97582 29.29288 29.11453 29.98647 29.07490 29.74867 29.47123 29.98647 
     105      106      107      108      109      110      111      112 
29.90720 29.53068 29.76849 29.74867 29.19380 29.43160 29.80812 29.39197 
     113      114      115      116      117      118      119      120 
29.86757 29.05508 29.70904 29.55050 29.35233 29.55050 29.57032 29.13435 
     121      122      123      124      125 
29.45142 29.35233 29.59013 29.23343 29.15417 

Linear regression performance: predicted values in test sample

  • We can look the 95% CI of any predicted values
# obtain 125 predicted observations + 95% 
pred_val_test <- predict(lr_model, nhanes_test, interval = "confidence")
# print out first 6 predicted values and CI boundaries 
head(pred_val_test)
       fit      lwr      upr
1 29.17398 28.45597 29.89199
2 28.81728 27.61741 30.01715
3 29.07490 28.24502 29.90478
4 29.66940 28.88607 30.45273
5 29.51087 28.87256 30.14917
6 29.33252 28.72249 29.94254
  • We can look the CI 95% of a single predicted values
# obtain 125 predicted observations + 95% 
pred_val_test <- predict(lr_model, nhanes_test, interval = "prediction")
# print out first 6 predicted values and CI boundaries 
head(pred_val_test)
       fit      lwr      upr
1 29.17398 15.76447 42.58349
2 28.81728 15.37336 42.26120
3 29.07490 15.65894 42.49086
4 29.66940 16.25624 43.08257
5 29.51087 16.10539 42.91634
6 29.33252 15.92836 42.73668

Linear regression performance: RMSE

Basically we are asking: “how does the prediction compare to the actual test dataset?”

For this we take the difference between the predicted and the actual value as

RMSE = Root Means Squared Error

# --- testing sample 
RMSE <- sqrt(mean(
  (nhanes_test$bmi - predict(lr_model, nhanes_test))^2,
  na.rm = T))

RMSE # 7.451677
[1] 7.779031

This is quite close to the Residual standard error that we got from the regression model summary (6.843) – despite that was taken from training data and this comes from testing data

Linear regression performance: \(R^2\)

#Multiple R-Squared (Coefficient of Determination)
SSyy=sum((nhanes_test$bmi - mean(nhanes_test$bmi, na.rm = T))**2)
SSE=sum(lr_model$residuals**2)
(SSyy-SSE)/SSyy
[1] NA
#Alternatively
1-SSE/SSyy
[1] NA

Linear regression models outputs: model fit

str(summary(lr_model)) 
List of 12
 $ call         : language lm(formula = bmi ~ age, data = nhanes)
 $ terms        :Classes 'terms', 'formula'  language bmi ~ age
  .. ..- attr(*, "variables")= language list(bmi, age)
  .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:2] "bmi" "age"
  .. .. .. ..$ : chr "age"
  .. ..- attr(*, "term.labels")= chr "age"
  .. ..- attr(*, "order")= int 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(bmi, age)
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:2] "bmi" "age"
 $ residuals    : Named num [1:497] -1.492 0.0675 -3.9627 -3.156 -2.492 ...
  ..- attr(*, "names")= chr [1:497] "1" "2" "3" "4" ...
 $ coefficients : num [1:2, 1:4] 28.4011 0.0198 0.9617 0.0182 29.5315 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "(Intercept)" "age"
  .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
 $ aliased      : Named logi [1:2] FALSE FALSE
  ..- attr(*, "names")= chr [1:2] "(Intercept)" "age"
 $ sigma        : num 6.82
 $ df           : int [1:3] 2 495 2
 $ r.squared    : num 0.00238
 $ adj.r.squared: num 0.000362
 $ fstatistic   : Named num [1:3] 1.18 1 495
  ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
 $ cov.unscaled : num [1:2, 1:2] 1.99e-02 -3.58e-04 -3.58e-04 7.17e-06
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "(Intercept)" "age"
  .. ..$ : chr [1:2] "(Intercept)" "age"
 $ na.action    : 'omit' Named int [1:3] 352 356 366
  ..- attr(*, "names")= chr [1:3] "352" "356" "366"
 - attr(*, "class")= chr "summary.lm"
names(summary(lr_model))
 [1] "call"          "terms"         "residuals"     "coefficients" 
 [5] "aliased"       "sigma"         "df"            "r.squared"    
 [9] "adj.r.squared" "fstatistic"    "cov.unscaled"  "na.action"    
summary(lr_model)$sigma
[1] 6.815192
summary(lr_model)$df
[1]   2 495   2
summary(lr_model)$r.squared
[1] 0.00237723
summary(lr_model)$adj.r.squared
[1] 0.0003618303
summary(lr_model)$fstatistic
     value      numdf      dendf 
  1.179533   1.000000 495.000000 
summary(lr_model)$cov.unscaled  ##lm 
              (Intercept)           age
(Intercept)  0.0199133625 -3.582132e-04
age         -0.0003582132  7.168014e-06

_______SALTO<<<<<